1 + 1[1] 2
Data 570: Predictive Modelling
Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.
When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:
1 + 1[1] 2
You can add options to executable code like this
[1] 4
The echo: false option disables the printing of code (only output is displayed).
Hello! My name is RUOCEHNYANG, and I am a first-year master’s student in Data Science at UBC.
#This picture was taken at Greenville island in Vancouver
install.packages("tidyverse", repos = "https://cran.rstudio.com/")Installing package into 'C:/Users/magic/AppData/Local/R/win-library/4.5'
(as 'lib' is unspecified)
package 'tidyverse' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\magic\AppData\Local\Temp\Rtmp4wNzSl\downloaded_packages
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read CSV file (adjust the path if needed)
sales <- read_csv("sales_data.csv")Rows: 30 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Order ID, Product Name, Category, Customer ID, Customer Gender, Pa...
dbl (4): Price, Quantity Sold, Total Sales, Customer Age
date (1): Date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Preview the first few rows
head(sales)# A tibble: 6 × 12
Date `Order ID` `Product Name` Category Price `Quantity Sold`
<date> <chr> <chr> <chr> <dbl> <dbl>
1 2023-01-01 ORD1001 Smartphone Mobile 300. 1
2 2023-01-02 ORD1002 Laptop Computers 900. 2
3 2023-01-03 ORD1003 Headphones Accessories 50.0 3
4 2023-01-04 ORD1004 Tablet Mobile 200. 1
5 2023-01-05 ORD1005 Smartphone Mobile 300. 2
6 2023-01-06 ORD1006 Smartwatch Accessories 150. 1
# ℹ 6 more variables: `Total Sales` <dbl>, `Customer ID` <chr>,
# `Customer Age` <dbl>, `Customer Gender` <chr>, `Payment Method` <chr>,
# `Store Location` <chr>
# Display structure of the dataset
str(sales)spc_tbl_ [30 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Date : Date[1:30], format: "2023-01-01" "2023-01-02" ...
$ Order ID : chr [1:30] "ORD1001" "ORD1002" "ORD1003" "ORD1004" ...
$ Product Name : chr [1:30] "Smartphone" "Laptop" "Headphones" "Tablet" ...
$ Category : chr [1:30] "Mobile" "Computers" "Accessories" "Mobile" ...
$ Price : num [1:30] 300 900 50 200 300 ...
$ Quantity Sold : num [1:30] 1 2 3 1 2 1 1 2 1 2 ...
$ Total Sales : num [1:30] 300 1800 150 200 600 ...
$ Customer ID : chr [1:30] "CUST500" "CUST501" "CUST502" "CUST503" ...
$ Customer Age : num [1:30] 34 29 42 38 25 31 27 40 35 33 ...
$ Customer Gender: chr [1:30] "Female" "Male" "Non-binary" "Female" ...
$ Payment Method : chr [1:30] "Credit Card" "Cash" "PayPal" "Credit Card" ...
$ Store Location : chr [1:30] "New York" "Los Angeles" "Chicago" "New York" ...
- attr(*, "spec")=
.. cols(
.. Date = col_date(format = ""),
.. `Order ID` = col_character(),
.. `Product Name` = col_character(),
.. Category = col_character(),
.. Price = col_double(),
.. `Quantity Sold` = col_double(),
.. `Total Sales` = col_double(),
.. `Customer ID` = col_character(),
.. `Customer Age` = col_double(),
.. `Customer Gender` = col_character(),
.. `Payment Method` = col_character(),
.. `Store Location` = col_character()
.. )
- attr(*, "problems")=<externalptr>
# Remove duplicate rows
sales <- distinct(sales)
# Check for missing values
colSums(is.na(sales)) Date Order ID Product Name Category Price
0 0 0 0 0
Quantity Sold Total Sales Customer ID Customer Age Customer Gender
0 0 0 0 0
Payment Method Store Location
0 0
sales_ny <- filter(sales, `Store Location` == "New York")
head(sales_ny)# A tibble: 6 × 12
Date `Order ID` `Product Name` Category Price `Quantity Sold`
<date> <chr> <chr> <chr> <dbl> <dbl>
1 2023-01-01 ORD1001 Smartphone Mobile 300. 1
2 2023-01-04 ORD1004 Tablet Mobile 200. 1
3 2023-01-07 ORD1007 Laptop Computers 800. 1
4 2023-01-10 ORD1010 Headphones Accessories 60.0 2
5 2023-01-12 ORD1012 Laptop Computers 950. 1
6 2023-01-15 ORD1015 Headphones Accessories 90.0 1
# ℹ 6 more variables: `Total Sales` <dbl>, `Customer ID` <chr>,
# `Customer Age` <dbl>, `Customer Gender` <chr>, `Payment Method` <chr>,
# `Store Location` <chr>
sales_ny %>%
group_by(Date) %>%
summarise(total = sum(`Total Sales`, na.rm = TRUE)) %>%
arrange(desc(total)) %>%
slice(1)# A tibble: 1 × 2
Date total
<date> <dbl>
1 2023-01-27 2400.
sales %>%
count(`Payment Method`, sort = TRUE)# A tibble: 3 × 2
`Payment Method` n
<chr> <int>
1 Credit Card 13
2 Cash 9
3 PayPal 8
Visualizition
ggplot(sales, aes(x = `Customer Age`)) +
geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
labs(x = "Customer Age", y = "Count") +
theme_minimal()ggplot(sales, aes(x = `Quantity Sold`, y = Price)) +
geom_point(alpha = 0.6) +
labs(
title = "Relationship between Quantity and Price",
x = "Quantity Sold",
y = "Price"
) +
theme_minimal()As shown in ?@fig-quantity-price, there appears to be a negative relationship between quantity sold and price — as the quantity increases, the price tends to decrease slightly.